Make partitions loading for iceberg table lazy to avoid unnecessary loading #23645
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Currently, when querying Iceberg tables, we always eagerly load all partition values of the table in
PickTableLayout
orIcebergFilterPushdown
during the optimization phase. Due to the fact that this eagerly loaded partition values are currently only used in metadata based optimization rules, in many cases this information is not used at all. It can result in a lot of waste of resources and performance in the following cases:This PR makes the loading behavior of partitions lazy and support setting a threshold for the maximum number of partitions that can be loaded during the loading phase. In this way, we can avoid a lot of unnecessary loading in many scenarios, as well as the resulting resource consumption and performance loss.
The benchmark's results also support the above conclusion. We execute regular query statements, query statements which are applicable for metadata optimization, and query statements which are applicable for further reducible optimization on tables with two different partition numbers. Among them,
300 * 3
will make the table contain 900 partitions (not reaching the default threshold 1000), while400 * 4
will make the table contain 1600 partitions (exceeding the default threshold of 1000). The code can be viewed here: https://github.com/hantangwangd/presto/blob/benchmark_for_lazy_load/presto-iceberg/src/test/java/com/facebook/presto/iceberg/BenchmarkIcebergQuery.javaThe benchmark test result before this change is as follows:
While the benchmark test result after this change is as follows:
Due to the issues mentioned above, we found that this change significantly improves the performance of queries that are not suitable for metadata optimization, while for statements that can be optimized based on metadata, the performance also improves when the number of partitions exceeds the threshold. This is in line with expectations, and it can be anticipated that as the number of partitions increases, the performance improvement will further increase.
Motivation and Context
Make partitions loading for iceberg table lazy to avoid unnecessary loading
Impact
N/A
Test Plan
TestIcebergLogicalPlanner
to show the behaviors with different max partition thresholdsContributor checklist
Release Notes